{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recommender Systems 2020/21\n",
    "\n",
    "### Practice - Implicit Alternating Least Squares\n",
    "\n",
    "See:\n",
    "Y. Hu, Y. Koren and C. Volinsky, Collaborative filtering for implicit feedback datasets, ICDM 2008.\n",
    "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.5120&rep=rep1&type=pdf\n",
    "\n",
    "R. Pan et al., One-class collaborative filtering, ICDM 2008.\n",
    "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.306.4684&rep=rep1&type=pdf\n",
    "\n",
    "Factorization model for binary feedback.\n",
    "First, splits the feedback matrix R as the element-wise a Preference matrix P and a Confidence matrix C.\n",
    "Then computes the decomposition of them into the dot product of two matrices X and Y of latent factors.\n",
    "X represent the user latent factors, Y the item latent factors.\n",
    "\n",
    "The model is learned by solving the following regularized Least-squares objective function with Stochastic Gradient Descent\n",
    "    \n",
    "$$\\frac{1}{2}\\sum_{i,j}{c_{ij}\\left(p_{ij}-x_i^T y_j\\right) + \\lambda\\left(\\sum_{i}{||x_i||^2} + \\sum_{j}{||y_j||^2}\\right)}$$\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Movielens10M: Verifying data consistency...\n",
      "Movielens10M: Verifying data consistency... Passed!\n",
      "DataReader: current dataset is: <class 'Data_manager.Dataset.Dataset'>\n",
      "\tNumber of items: 10681\n",
      "\tNumber of users: 69878\n",
      "\tNumber of interactions in URM_all: 10000054\n",
      "\tValue range in URM_all: 0.50-5.00\n",
      "\tInteraction density: 1.34E-02\n",
      "\tInteractions per user:\n",
      "\t\t Min: 2.00E+01\n",
      "\t\t Avg: 1.43E+02\n",
      "\t\t Max: 7.36E+03\n",
      "\tInteractions per item:\n",
      "\t\t Min: 0.00E+00\n",
      "\t\t Avg: 9.36E+02\n",
      "\t\t Max: 3.49E+04\n",
      "\tGini Index: 0.57\n",
      "\n",
      "\tICM name: ICM_genres, Value range: 1.00 / 1.00, Num features: 20, feature occurrences: 21564, density 1.01E-01\n",
      "\tICM name: ICM_tags, Value range: 1.00 / 69.00, Num features: 10217, feature occurrences: 108563, density 9.95E-04\n",
      "\tICM name: ICM_all, Value range: 1.00 / 69.00, Num features: 10237, feature occurrences: 130127, density 1.19E-03\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from Notebooks_utils.data_splitter import train_test_holdout\n",
    "from Data_manager.Movielens.Movielens10MReader import Movielens10MReader\n",
    "\n",
    "data_reader = Movielens10MReader()\n",
    "data_loaded = data_reader.load_data()\n",
    "\n",
    "URM_all = data_loaded.get_URM_all()\n",
    "\n",
    "URM_train, URM_test = train_test_holdout(URM_all, train_perc = 0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<69878x10681 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 8001405 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "URM_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What do we need for IALS?\n",
    "\n",
    "* User factor and Item factor matrices\n",
    "* Confidence function\n",
    "* Update rule for items\n",
    "* Update rule for users\n",
    "* Training loop and some patience\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_users, n_items = URM_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: We create the dense latent factor matrices\n",
    "### In a MF model you have two matrices, one with a row per user and the other with a column per item. The other dimension, columns for the first one and rows for the second one is called latent factors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_factors = 10\n",
    "\n",
    "user_factors = np.random.random((n_users, num_factors))\n",
    "item_factors = np.random.random((n_items, num_factors))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.56445343, 0.01627608, 0.17512001, ..., 0.87661018, 0.3899033 ,\n",
       "        0.32641433],\n",
       "       [0.1088799 , 0.19870041, 0.10541365, ..., 0.20993058, 0.02079788,\n",
       "        0.08099792],\n",
       "       [0.13317211, 0.50273745, 0.43626129, ..., 0.42791938, 0.82443448,\n",
       "        0.38202853],\n",
       "       ...,\n",
       "       [0.10204191, 0.85591125, 0.20562087, ..., 0.03072743, 0.47650508,\n",
       "        0.31997839],\n",
       "       [0.63161141, 0.30725951, 0.98158763, ..., 0.42191041, 0.93405159,\n",
       "        0.74128234],\n",
       "       [0.36176751, 0.02773363, 0.91659522, ..., 0.45599821, 0.95870616,\n",
       "        0.85224132]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_factors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.96778511, 0.03326854, 0.42771504, ..., 0.24565161, 0.29795505,\n",
       "        0.82082031],\n",
       "       [0.82495307, 0.34789801, 0.93389078, ..., 0.10382001, 0.15947663,\n",
       "        0.80584883],\n",
       "       [0.16349275, 0.07321222, 0.33470242, ..., 0.35797385, 0.77474128,\n",
       "        0.01197913],\n",
       "       ...,\n",
       "       [0.04299959, 0.73249113, 0.17246724, ..., 0.51621872, 0.67975724,\n",
       "        0.53853707],\n",
       "       [0.3429618 , 0.98190495, 0.6485606 , ..., 0.81386613, 0.669452  ,\n",
       "        0.56823051],\n",
       "       [0.47554082, 0.39055113, 0.37510228, ..., 0.999072  , 0.22196863,\n",
       "        0.4787177 ]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "item_factors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: We define a function to transform the interaction data in a \"confidence\" value. \n",
    "* If you have explicit data, the higher it is the higher the confidence (logarithmic, linear?)\n",
    "* Other options include scaling the data lowering it if the item or use has very few interactions (lower support)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_confidence_function(URM_train, alpha):\n",
    "    \n",
    "    URM_train.data = 1.0 + alpha*URM_train.data\n",
    "    \n",
    "    return URM_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alpha = 0.5\n",
    "C_URM_train = linear_confidence_function(URM_train, alpha)\n",
    "\n",
    "C_URM_train.data[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Define the update rules for the user factors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Update latent factors for a single user or item.\n",
    "\n",
    "Y = |n_interactions|x|n_factors|\n",
    "\n",
    "YtY =   |n_factors|x|n_factors|\n",
    "\n",
    "\n",
    "\n",
    "Latent factors ony of item/users for which an interaction exists in the interaction profile\n",
    "Y_interactions = Y[interaction_profile, :]\n",
    "\n",
    "Following the notation of the original paper we report the update rule for the Item factors (User factors are identical):\n",
    "* __Y__ are the item factors |n_items|x|n_factors|\n",
    "* __Cu__ is a diagonal matrix |n_interactions|x|n_interactions| with the user confidence for the observed items\n",
    "* __p(u)__ is a boolean vectors indexing only observed items. Here it will disappear as we already extract only the observed latent factors however, it will have an impact in the dimensions of the matrix, since it transforms Cu from a diagonal matrix to a row vector of 1 row and |n_interactions| columns\n",
    "\n",
    "$$(Yt*Cu*Y + reg*I)^-1 * Yt*Cu*profile$$ which can be decomposed as $$(YtY + Yt*(Cu-I)*Y + reg*I)^-1 * Yt*Cu*p(u)$$ \n",
    "\n",
    "* __A__ = (|n_interactions|x|n_factors|) dot (|n_interactions|x|n_interactions| ) dot (|n_interactions|x|n_factors| )\n",
    "  = |n_factors|x|n_factors|\n",
    "  \n",
    "We use an equivalent formulation (v * k.T).T which is much faster\n",
    "* __A__ = Y_interactions.T.dot(((interaction_confidence - 1) * Y_interactions.T).T)\n",
    "* __B__ = YtY + A + self.regularization_diagonal\n",
    "* __new factors__ = np.dot(np.linalg.inv(B), Y_interactions.T.dot(interaction_confidence))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def _update_row(interaction_profile, interaction_confidence, Y, YtY, regularization_diagonal):\n",
    "\n",
    "    Y_interactions = Y[interaction_profile, :]\n",
    "    \n",
    "    A = Y_interactions.T.dot(((interaction_confidence - 1) * Y_interactions.T).T)\n",
    "\n",
    "    B = YtY + A + regularization_diagonal\n",
    "\n",
    "    return np.dot(np.linalg.inv(B), Y_interactions.T.dot(interaction_confidence))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "regularization_coefficient = 1e-4\n",
    "\n",
    "regularization_diagonal = np.diag(regularization_coefficient * np.ones(num_factors))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 10)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# VV = n_factors x n_factors\n",
    "VV = item_factors.T.dot(item_factors)\n",
    "VV.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_id = 154"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_pos = C_URM_train.indptr[user_id]\n",
    "end_pos = C_URM_train.indptr[user_id + 1]\n",
    "\n",
    "user_profile = C_URM_train.indices[start_pos:end_pos]\n",
    "user_confidence = C_URM_train.data[start_pos:end_pos]\n",
    "\n",
    "user_factors[user_id, :] = _update_row(user_profile, user_confidence, item_factors, VV, regularization_diagonal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Apply updates on the user item factors as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 10)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# UU = n_factors x n_factors\n",
    "UU = user_factors.T.dot(user_factors)\n",
    "UU.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "item_id = 154"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "C_URM_train_csc = C_URM_train.tocsc()\n",
    "\n",
    "start_pos = C_URM_train_csc.indptr[item_id]\n",
    "end_pos = C_URM_train_csc.indptr[item_id + 1]\n",
    "\n",
    "item_profile = C_URM_train_csc.indices[start_pos:end_pos]\n",
    "item_confidence = C_URM_train_csc.data[start_pos:end_pos]\n",
    "\n",
    "item_factors[item_id, :] = _update_row(item_profile, item_confidence, user_factors, UU, regularization_diagonal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's put all together in a training loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 69878 in 9.08 seconds. Users per second 7695.57\n",
      "Iteration 10681 in 11.38 seconds. Items per second 938.49\n",
      "Epoch 1 complete in in 11.39 seconds\n",
      "Iteration 69878 in 8.96 seconds. Users per second 7798.79\n",
      "Iteration 10681 in 11.30 seconds. Items per second 945.14\n",
      "Epoch 2 complete in in 11.30 seconds\n",
      "Iteration 69878 in 9.18 seconds. Users per second 7611.73\n",
      "Iteration 10681 in 11.41 seconds. Items per second 936.01\n",
      "Epoch 3 complete in in 11.41 seconds\n",
      "Iteration 69878 in 9.48 seconds. Users per second 7371.05\n",
      "Iteration 10681 in 11.87 seconds. Items per second 899.77\n",
      "Epoch 4 complete in in 11.87 seconds\n",
      "Iteration 69878 in 9.14 seconds. Users per second 7644.97\n",
      "Iteration 10681 in 12.09 seconds. Items per second 883.35\n",
      "Epoch 5 complete in in 12.10 seconds\n",
      "Iteration 69878 in 9.49 seconds. Users per second 7363.08\n",
      "Iteration 10681 in 11.79 seconds. Items per second 905.84\n",
      "Epoch 6 complete in in 11.80 seconds\n",
      "Iteration 69878 in 9.38 seconds. Users per second 7449.51\n",
      "Iteration 10681 in 11.83 seconds. Items per second 902.78\n",
      "Epoch 7 complete in in 11.83 seconds\n",
      "Iteration 69878 in 9.31 seconds. Users per second 7505.56\n",
      "Iteration 10681 in 11.65 seconds. Items per second 916.54\n",
      "Epoch 8 complete in in 11.65 seconds\n",
      "Iteration 69878 in 9.60 seconds. Users per second 7280.70\n",
      "Iteration 10681 in 11.91 seconds. Items per second 896.91\n",
      "Epoch 9 complete in in 11.91 seconds\n",
      "Iteration 69878 in 9.21 seconds. Users per second 7587.12\n",
      "Iteration 10681 in 11.50 seconds. Items per second 928.70\n",
      "Epoch 10 complete in in 11.50 seconds\n"
     ]
    }
   ],
   "source": [
    "C_URM_train_csc = C_URM_train.tocsc()\n",
    "\n",
    "num_factors = 10\n",
    "\n",
    "user_factors = np.random.random((n_users, num_factors))\n",
    "item_factors = np.random.random((n_items, num_factors))\n",
    "\n",
    "\n",
    "for n_epoch in range(10):\n",
    "    \n",
    "    start_time = time.time()\n",
    "\n",
    "    for user_id in range(C_URM_train.shape[0]):\n",
    "\n",
    "        start_pos = C_URM_train.indptr[user_id]\n",
    "        end_pos = C_URM_train.indptr[user_id + 1]\n",
    "\n",
    "        user_profile = C_URM_train.indices[start_pos:end_pos]\n",
    "        user_confidence = C_URM_train.data[start_pos:end_pos]\n",
    "\n",
    "        user_factors[user_id, :] = _update_row(user_profile, user_confidence, item_factors, VV, regularization_diagonal)   \n",
    "\n",
    "        # Print some stats\n",
    "        if (user_id +1)% 100000 == 0 or user_id == C_URM_train.shape[0]-1:\n",
    "            elapsed_time = time.time() - start_time\n",
    "            samples_per_second = user_id/elapsed_time\n",
    "            print(\"Iteration {} in {:.2f} seconds. Users per second {:.2f}\".format(user_id+1, elapsed_time, samples_per_second))\n",
    "\n",
    "\n",
    "    for item_id in range(C_URM_train.shape[1]):\n",
    "\n",
    "        start_pos = C_URM_train_csc.indptr[item_id]\n",
    "        end_pos = C_URM_train_csc.indptr[item_id + 1]\n",
    "\n",
    "        item_profile = C_URM_train_csc.indices[start_pos:end_pos]\n",
    "        item_confidence = C_URM_train_csc.data[start_pos:end_pos]\n",
    "\n",
    "        item_factors[item_id, :] = _update_row(item_profile, item_confidence, user_factors, UU, regularization_diagonal)    \n",
    "\n",
    "        # Print some stats\n",
    "        if (item_id +1)% 100000 == 0 or item_id == C_URM_train.shape[1]-1:\n",
    "            elapsed_time = time.time() - start_time\n",
    "            samples_per_second = item_id/elapsed_time\n",
    "            print(\"Iteration {} in {:.2f} seconds. Items per second {:.2f}\".format(item_id+1, elapsed_time, samples_per_second))\n",
    "\n",
    "    total_epoch_time = time.time() - start_time  \n",
    "    print(\"Epoch {} complete in in {:.2f} seconds\".format(n_epoch+1, total_epoch_time))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How long do we train such a model?\n",
    "\n",
    "* An epoch: a complete loop over all the train data\n",
    "* Usually you train for multiple epochs. Depending on the algorithm and data 10s or 100s of epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimated time with the previous training speed is 115.00 seconds, or 1.92 minutes\n"
     ]
    }
   ],
   "source": [
    "estimated_seconds = total_epoch_time*10\n",
    "print(\"Estimated time with the previous training speed is {:.2f} seconds, or {:.2f} minutes\".format(estimated_seconds, estimated_seconds/60))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lastly: Computing a prediction for any given user or item"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_id = 17025\n",
    "item_id = 468"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6921570260621303"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predicted_rating = np.dot(user_factors[user_id,:], item_factors[item_id,:])\n",
    "predicted_rating"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
